14 research outputs found
Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers
The massive amounts of digitized historical documents acquired over the last
decades naturally lend themselves to automatic processing and exploration.
Research work seeking to automatically process facsimiles and extract
information thereby are multiplying with, as a first essential step, document
layout analysis. If the identification and categorization of segments of
interest in document images have seen significant progress over the last years
thanks to deep learning techniques, many challenges remain with, among others,
the use of finer-grained segmentation typologies and the consideration of
complex, heterogeneous documents such as historical newspapers. Besides, most
approaches consider visual features only, ignoring textual signal. In this
context, we introduce a multimodal approach for the semantic segmentation of
historical newspapers that combines visual and textual features. Based on a
series of experiments on diachronic Swiss and Luxembourgish newspapers, we
investigate, among others, the predictive power of visual and textual features
and their capacity to generalize across time and sources. Results show
consistent improvement of multimodal models in comparison to a strong visual
baseline, as well as better robustness to high material variance
Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers
The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance
Une approche computationnelle du cadastre napoléonien de Venise
Au dĂ©but du xixá” siĂšcle, lâadministration napolĂ©onienne impose Ă la ville de Venise la mise en place dâun nouveau systĂšme de description standardisĂ© pour rendre compte de maniĂšre objective de la forme et des fonctions du tissu urbain. Le cadastre, dĂ©ployĂ© Ă lâĂ©chelle europĂ©enne, offre pour la premiĂšre fois une vue articulĂ©e et prĂ©cise de la structure de la ville et de ses activitĂ©s grĂące Ă une approche mĂ©thodique et Ă des catĂ©gories standardisĂ©es. Les techniques numĂ©riques, basĂ©es notamment sur lâapprentissage profond, permettent aujourdâhui dâextraire de ces documents une reprĂ©sentation Ă la fois prĂ©cise et dense de la ville et de ses habitants. En sâattachant Ă vĂ©rifier systĂ©matiquement la cohĂ©rence de lâinformation extraite, ces techniques Ă©valuent aussi la prĂ©cision et la systĂ©maticitĂ© du travail des arpenteurs et des sondeurs de lâEmpire et qualifient par consĂ©quent, de façon indirecte, la confiance Ă accorder aux informations extraites. Cet article revient sur lâhistoire de ce protosystĂšme computationnel, dĂ©crit la maniĂšre dont les techniques numĂ©riques offrent non seulement une documentation systĂ©matique, mais aussi des perspectives dâextraction dâinformations latentes, encore non explicitĂ©es, mais implicitement prĂ©sentes dans ce systĂšme dâinformation du passĂ©.At the beginning of the 19th century, the Napoleonic administration introduced a new standardised description system to give an objective account of the form and functions of the city of Venice. The cadastre, deployed on a European scale, was offering for the first time an articulated and precise view of the structure of the city and its activities, through a methodical approach and standardised categories. With the use of digital techniques, based in particular on deep learning, it is now possible to extract from these documents an accurate and dense representation of the city and its inhabitants. By systematically checking the consistency of the extracted information, these techniques also evaluate the precision and systematicity of the surveyorsâ work and therefore indirectly qualify the trust to be placed in the extracted information. This article reviews the history of this computational protosystem and describes how digital techniques offer not only systematic documentation, but also extraction perspectives for latent information, as yet uncharted, but implicitly present in this information system of the past
Historical newspaper semantic segmentation using visual and textual features
Mass digitization and the opening of digital libraries gave access to a huge amount of historical newspapers. In order to bring structure into these documents, current techniques generally proceed in two distinct steps. First, they segment the digitized images into generic articles and then classify the text of the articles into finer-grained categories. Unfortunately, by losing the link between layout and text, these two steps are not able to account for the fact that newspaper content items have distinctive visual features. This project proposes two main novelties. Firstly, it introduces the idea of merging the segmentation and classification steps, resulting in a fine- grained semantic segmentation of newspapers images. Secondly, it proposes to use textual features under the form of embeddings maps at segmentation step. The semantic segmentation with four categories (feuilleton, weather forecast, obituary, and stock exchange table) is done using a fully convolutional neural network and reaches a mIoU of 79.3%. The introduction of embeddings maps improves the overall performances by 3% and the generalization across time and newspapers by 8% and 12%, respectively. This shows a strong potential to consider the semantic aspect in the segmentation of newspapers and to use textual features to improve generalization
Language Resources for Historical Newspapers: the Impresso Collection
Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challengeâ and real promise of digitizationâ is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this âBig Data of the Pastâ. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the âimpresso - Media Monitoring of the Pastâ project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents
Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers
The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance
Datasets and Models for Historical Newspaper Article Segmentation
Dataset and models used and produced in the work described in the paper "Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers": https://infoscience.epfl.ch/record/282863?ln=e
Repopulating Paris: massive extraction of 4 Million addresses from city directories between 1839 and 1922
In 1839, in Paris, the Maison Didot bought the Bottin company. SĂ©bastien Bottin trained as a statistician was the initiator of a high impact yearly publication, called âAlmanachs" containing the listing of residents, businesses and institutions, arranged geographically, alphabetically and by activity typologies (Fig. 1). These regular publications encountered a great success. In 1820, the Parisian Bottin Almanach contained more than 50 000 addresses and until the end of the 20th century the word âBottinâ was the colloquial term to designate a city directory in France. The publication of the âDidot-Bottinâ continued at an annual rhythm, mapping the evolution of the active population of Paris and other cities in France.The relevance of automatically mining city directories for historical reconstruction has already been argued by several authors (e.g Osborne, N., Hamilton, G. and Macdonald, S. 2014 or Berenbaum, D. et al. (2016). This article reports on the extraction and analysis of the data contained in âDidot-Bottinâ covering the period 1839-1922 for Paris, digitized by the Bibliotheque nationale de France. We process more than 27 500 pages to create a database of 4,2 Million entries linking addresses, person mention and activities
Repopulating Paris: massive extraction of 4 Million addresses from city directories between 1839 and 1922.
Abstract of paper 0878 presented at the Digital Humanities Conference 2019 (DH2019), Utrecht , the Netherlands 9-12 July, 2019
Language Resources for Historical Newspapers: the Impresso Collection
Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge-- and real promise of digitization-- is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this `Big Data of the Past'. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the `impresso - Media Monitoring of the Past' project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents